4 research outputs found
A summary of the 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition
We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding zero resource (unsupervised) speech technologies and related models of early language acquisition. Centered around the tasks of phonetic and lexical discovery, we consider unified evaluation metrics, present two new approaches for improving speaker independence in the absence of supervision, and evaluate the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations. Finally, we present two strategies for integrating zero resource techniques into supervised settings, demonstrating the potential of unsupervised methods to improve mainstream technologies.5 page(s
Techniques for two-stage open vocabulary spoken term detection and verification
Spoken term detection (STD) is one of many applications that require a capability for searchand retrieval of spoken content from large media repositories. In a typical STD scenario, auser enters a query term consisting of a word or phrase and, in response, the search enginereturns a list of detected occurrences of the query term in the repository. The state-ofthe-art STD systems use an automatic speech recognition (ASR) system for generatinga tokenized representation of the speech and perform search on this representation tofind hypothesized occurrences of the query terms. Varying acoustic conditions, speakerpopulations, and speaking styles, along with specialized task domains, all contribute togenerally poor speech recognition performance in many STD scenarios. Furthermore, thesize of media repositories can be extremely large, in some cases on the order of thousandsof hours of audio material. These would reduce the search accuracy and speed respectivelyin ASR-based STD systems. The objective of this thesis is to address these issues.The work presented in this thesis constitutes four major contributions. The first is thedevelopment of a fast and accurate ASR-based STD approach for large audio repositories.This approach is based on eďcient indexing of ASR outputs and a two-stage phonemebasedsearch procedure which facilitates detecting occurrences of all query terms, whetherthey belong to the ASR vocabulary or not. The second contribution is the developmentof a graph-based approach for verifying the occurrence of query terms in the set of candidatespeech intervals derived from an STD system. In this approach, the confidence scoresassociated with the hypothesized query term occurrences, generated by the original STDsystem, are adjusted based on the acoustic similarity of the corresponding acoustic intervalsto each other and to other intervals in the repository. The third contribution of this thesisis the use of a feature representation and modeling formalism, distinct from those used inconventional ASR systems, for generating alternative confidence scores for a given set ofhypothesized query term occurrences. It is shown that the resulting confidence scores arecomplementary to the confidence scores estimated in conventional ASR-based STD systems.The fourth contribution is the development of two manifold-based semi-supervisedapproaches for verifying hypothesized occurrences of query terms. It is demonstrated thatdeploying unlabeled data in addition to labeled data in training term-dependent modelsunder the proposed semi-supervised framework improves the verification accuracy. Moreover,in extremely low-resource scenarios, reasonably good STD performance is achieved by only exploiting the similarity of the hypothesized query term occurrences using a semisupervisedapproach based on graph spectral clustering.La d`etection de terme parl´es (DTP) est une des nombreuses applications qui a la capacit´ede rechercher et de retrouver un contenu parl´e dans les grands r´epertoire multim´edia. Dansun sc´enario DTP typique, un utilisateur entre un terme de la requËete constitu´e un mot ouune phrase et, en r´eponse, le moteur de recherche renvoie une liste des occurrences d´etect´eesdu terme de requËete dans le rpertoire. Lâ´etat-de-lâart des syst`emes DTP utilise un syst`emede reconnaissance vocale automatique (RVA) pour g´en´erer une repr´esentation ´ecrite de laparole et recherchent cette repr´esentation de trouver occurrences hypoth´etiques des termesde la requËete. Diÿ´erentes conditions acoustiques, populations de parleurs, et les styles deparler tout cela contribue `a performances g´en´eralement mauvaise de reconnaissance de laparole dans de nombreuses sc´enario de DTP. En outre, la taille des d´epËets de m´edias peutËetre tr`es grande, dans certains cas, de lâordre de milliers dâheures de matriel audio. Celar´eduirait la pr´ecision de recherche et la vitesse respectivement dans les syst`emes DTP bas´esRVA. Lâobjectif de cette th`ese est de traiter ces probl`emes.Le travail pr´esent´e dans cette th`ese contribue en quatre points. La premi`ere contributionest le d´eveloppement dune approche DTP bas´ee RVA qui soit rapide et pr´ecisepour une grande collection audio. Cette approche est bas´ee sur la cr´eation dâun indexde sortie du syst`eme RVA. Une recherche bas´ee phon`emes est eĂżectu´ee en deux ´etapespour trouver les occurrences des termes de la requte lâintrieur et lâextrieur du vocabulairede RVA. La seconde contribution est le d´eveloppement dune approche bas´ee graphepour v´erifier la pr´esence des termes recherch´es dans lâensemble des intervalles candidatsde parole qui sont trouv´es `a partir dun syst`eme DTP. Dans cette approche, les scores deconfidence associ´es avec la pr´esence des termes recherch´es, g´en´er´es par le syst`eme DTPoriginal, sont ajust´es `a partir de la similarit´e acoustique entre les intervalles acoustiquescorrespondants et avec dautres intervalles dans le r´epertoire. La troisi`eme contribution decette th`ese est lâutilisation dâune reprsentation de caractristiques acoustiques et un formalismede mod´elisation, distincts de ceux utilis´es dans les syst`emes RVA conventionnels pourla g´en´eration nouveaux scores de confiance associs aux occurrences de termes de requtehypothtique. Il est d´emontr´e que la performance obtenues par les scores de confidencealternatives sont compl´ementaires avec les scores estim´es grËace aux syst`emes DTP-RVAconventionnels. La quatri`eme contribution est le d´eveloppement de deux approches semisupervis´e bas´ee manifold pour la v´erification des occurrences de termes de requËete d´etect´es.Il est d´emontr´e que le d´eploiement de donn´ees non-´etiquet´ees, en plus des donn´ees ´etiquet´eesdans la formation des mod`eles semi-supervis´es, am´eliore la pr´ecision de v´erification. Deplus, dans les sc´enarios avec des ressources faibles, de bonnes performances DTP sont atteintes,en exploitant seulement la similarit´e des occurrences de termes de requËete d´etect´esgrËace `a une approche semi-supervis´ee, bas´ee graphe appel´e spectral clustering